consensus/bor, internal/cli: full grpc implementation#2194
consensus/bor, internal/cli: full grpc implementation#2194marcello33 wants to merge 17 commits intodevelopfrom
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review. |
|
Claude finished @marcello33's task in 3m 47s —— View job Code Review: gRPC Implementation
Overall this is a solid WIP. The auth interceptor, API caching, and header expansion are well-designed. I found one security-sensitive bug and a few smaller items. Finding 1 — uint64 overflow in
|
There was a problem hiding this comment.
Pull request overview
This WIP PR extends the gRPC implementation by adding bearer-token authentication and expanding the Bor gRPC API surface, while aligning Bor’s RPC API lifecycle with gRPC’s per-call access pattern.
Changes:
- Add optional gRPC bearer-token auth + unified unary/stream interceptors (also gating reflection).
- Expand BorApi gRPC handlers and improve header→proto conversion + batch endpoint size gating.
- Cache Bor’s
*APIinstance acrossAPIs()calls to preserve per-API state; update deps/tests accordingly.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/cli/server/server.go | Add combined unary/stream interceptors and metadata-based bearer auth. |
| internal/cli/server/config.go | Add GRPC.Token config field. |
| internal/cli/server/flags.go | Add --grpc.token CLI flag. |
| internal/cli/server/grpc_auth_test.go | Add unit tests for auth + interceptors. |
| internal/cli/server/api_service.go | Add/extend BorApi gRPC handlers, header conversion, and batch endpoint with size gate. |
| internal/cli/server/api_service_test.go | Add tests for header round-trip and batch-range guards. |
| consensus/bor/bor.go | Cache the Bor *API returned by APIs() across calls. |
| consensus/bor/bor_test.go | Add test asserting APIs() returns the same service instance across calls. |
| go.mod / go.sum | Bump github.com/0xPolygon/polyproto to a newer pseudo-version. |
| cmd/keeper/go.mod / cmd/keeper/go.sum | Bump indirect golang.org/x/sys. |
| eth/tracers/data.csv | Reorder/update CSV rows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code ReviewBug: uint64 overflow causes infinite loop in GetBlockInfoInBatch (DoS)File: internal/cli/server/api_service.go L197-L218 The for loop The size gate does not prevent this. For example, StartBlockNumber = MaxUint64, EndBlockNumber = MaxUint64 has a difference of 0, which passes the >= 256 check. Why fetchBlockInfo doesn't save us: rpc.BlockNumber(math.MaxUint64) casts to int64(-1) which is PendingBlockNumber — a valid block on mining/validator nodes. After wrapping to 0, the loop fetches from genesis through every block in the chain, causing unbounded memory allocation and CPU consumption. Suggested fix: Restructure the loop to count iterations instead of iterating on the block number directly: count := req.EndBlockNumber - req.StartBlockNumber + 1
for j := uint64(0); j < count; j++ {
info, ok := s.fetchBlockInfo(ctx, req.StartBlockNumber+j)
if !ok {
break
}
out.Blocks = append(out.Blocks, info)
}This avoids the i <= MaxUint64 trap since j < count where count is bounded by the size gate. Additional concern: fetchBlockInfo casts uint64 block numbers to rpc.BlockNumber (int64) without checking for overflow. Large uint64 values silently map to negative int64 values corresponding to special block numbers (-1 = Pending, -2 = Latest), which could return unexpected blocks. |
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (77.64%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2194 +/- ##
===========================================
+ Coverage 52.29% 52.51% +0.21%
===========================================
Files 884 884
Lines 155571 155947 +376
===========================================
+ Hits 81355 81892 +537
+ Misses 68989 68803 -186
- Partials 5227 5252 +25
... and 22 files with indirect coverage changes
🚀 New features to boost your workflow:
|
|
@claude addressed your points. Review again. |
|
Claude finished @marcello33's task in 2m 4s —— View job Re-review: gRPC Implementation (post
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@claude addressed your points from the previous run. |
|
Claude finished @marcello33's task in 1m 53s —— View job Re-review: gRPC Implementation (3rd pass — post
|
|
@claude addressed latest comments. Review. |
|
Claude finished @marcello33's task in 1m 16s —— View job Re-review: gRPC Implementation (4th pass — post
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 15 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Looks good in terms of implementation! The patch coverage is at 53%. Could we bump it up? Also, previous diffguard run failed, I restarted it and let's see if it can find something that could be simplified. |
Sure, @cffls , thanks. |
… cmd/keeper go.sum
mapBorAPIError now translates the two GetRootHash sentinels that previously
fell through to codes.Internal:
- "reorg occurred while computing checkpoint root" -> codes.Aborted
(transient/retriable; clients can retry once the chain settles)
- "non-contiguous headers in checkpoint range" -> codes.DataLoss
(DB inconsistency / ancient pruning; conveys unrecoverable server state
without the "this is a bug" implication of Internal)
Also runs go mod tidy in cmd/keeper to add the missing
golang.org/x/sys v0.42.0 h1: source hash that the previous bump left out.
|
@cffls |
| // readConfigFile may leave config.GRPC nil when the HCL/TOML config | ||
| // omits the [grpc] block. Treat that as "gRPC disabled" instead of | ||
| // dereferencing into a panic at startup. | ||
| if config.GRPC == nil { | ||
| log.Info("gRPC server disabled (no [grpc] block in config)") | ||
| return nil | ||
| } | ||
| addr := config.GRPC.Addr | ||
| if addr == "" { | ||
| log.Info("gRPC server disabled (grpc.addr is empty)") | ||
| return nil | ||
| } | ||
| // Mirror heimdall's client-side posture: warn (don't block) when the | ||
| // operator opts into a non-loopback bind without a bearer token. The | ||
| // startup log entry is what operators are expected to act on. | ||
| if config.GRPC.Token == "" && !IsLoopbackHostPort(addr) { | ||
| log.Warn( | ||
| "Starting unauthenticated gRPC server on non-loopback address; "+ | ||
| "set --grpc.token / BOR_GRPC_TOKEN to require authentication.", | ||
| "addr", addr, | ||
| ) | ||
| } | ||
| return srv.gRPCServerByAddress(addr) |
|
| Default: c.cliConfig.GRPC.Addr, | ||
| }) | ||
| f.StringFlag(&flagset.StringFlag{ | ||
| Name: "grpc.token", | ||
| Usage: "Raw token expected in the `authorization: Bearer <token>` header of incoming gRPC calls (empty disables auth; the `Bearer ` prefix is stripped before comparison). Prefer the BOR_GRPC_TOKEN environment variable over this flag.", | ||
| Value: &c.cliConfig.GRPC.Token, | ||
| Default: c.cliConfig.GRPC.Token, | ||
| }) |
There was a problem hiding this comment.
🟣 Pre-existing — the --grpc.addr deref at flags.go:1231 has had this defect since the [grpc] block was first added — but this PR is the natural place to land the symmetric fix because it adds a second deref at line 1237 (&c.cliConfig.GRPC.Token) and the author already established a nil-guard for the same scenario in WithGRPCAddress (server.go:88-91). When an operator supplies an HCL config without a [grpc] block, readConfigFile (config.go:1161-1165) only initializes TxPool/Cache/Sealer — leaving GRPC nil — and c.Flags(cfg) panics during command-line parsing, before WithGRPCAddress ever runs. The TOML loader is safe because readLegacyConfig starts from *DefaultConfig(). One-line fix: lazy-init if c.cliConfig.GRPC == nil { c.cliConfig.GRPC = &GRPCConfig{} } at the top of Flags(), or initialize GRPC inside readConfigFile alongside TxPool/Cache/Sealer.
Extended reasoning...
What the bug is
internal/cli/server/flags.go:1227-1239 registers the grpc.addr and (PR-added) grpc.token flags by binding their Value and Default fields to addresses inside c.cliConfig.GRPC:
f.StringFlag(&flagset.StringFlag{
Name: "grpc.addr",
Value: &c.cliConfig.GRPC.Addr, // line 1231 — pre-existing
Default: c.cliConfig.GRPC.Addr,
})
f.StringFlag(&flagset.StringFlag{
Name: "grpc.token",
Value: &c.cliConfig.GRPC.Token, // line 1237 — added by this PR
Default: c.cliConfig.GRPC.Token,
})Both expressions panic with a nil-pointer dereference if c.cliConfig.GRPC == nil.
How c.cliConfig.GRPC ends up nil
The HCL config loader at internal/cli/server/config.go:1155-1180:
config := &Config{
TxPool: &TxPoolConfig{},
Cache: &CacheConfig{},
Sealer: &SealerConfig{},
}
if err := hclsimple.DecodeFile(path, nil, config); err != nil { ... }Only TxPool, Cache, and Sealer are pre-initialized. Config.GRPC is declared as *GRPCConfig with the HCL tag hcl:"grpc,block" (config.go:1153), and hclsimple.DecodeFile does not allocate missing block-pointer fields — it leaves them nil. So an HCL config that omits the [grpc] block produces cfg.GRPC == nil.
The TOML path is safe because readLegacyConfig (config_legacy.go:18) starts from *DefaultConfig(), which initializes GRPC: &GRPCConfig{Addr: "127.0.0.1:3131"}. This is HCL-only.
Why the PRs new nil-guard does not save us
The PR added if config.GRPC == nil { return nil } at server.go:88-91 inside WithGRPCAddress. That guard correctly handles programmatically-constructed configs (tests, embedders), but WithGRPCAddress runs from NewServer (command.go:234) — much later than extractFlags. By the time it would protect us, Flags() has already crashed the process during command-line parsing.
The author also has a nil-check at command.go:138 (if c.cliConfig.GRPC != nil) for the env-var fallback, which acknowledges the nil-GRPC scenario — but again, that runs after c.Flags(cfg).
Step-by-step proof
- Operator writes a minimal
config.hclthat omits the[grpc]block. bor server -config config.hclinvokesRun() → extractFlags(args).command.go:91:cfg, _ := readConfigFile("config.hcl")—cfg.GRPC == nil.command.go:101:flags := c.Flags(cfg).flags.go:8-9:c.cliConfig = config— nil-GRPC propagates.flags.go:1231:&c.cliConfig.GRPC.Addr— nil pointer dereference.- Process panics during command-line parsing, before any guard in
WithGRPCAddresscan run.
Severity rationale
pre_existing — the grpc.addr deref at line 1231 has had this defect since the [grpc] block was added; the PR does not introduce the panic, only doubles the access count by adding grpc.token. HCL configs are uncommon in bor deployments (the repo ships only TOML examples and docs/cli/default_config.toml is TOML), so reachability is narrow. However, this PR is the natural place to land the fix because the author already established the parallel nil-guard pattern in WithGRPCAddress for the exact same scenario.
Suggested fix
Either of the following lands the fix:
- At the top of
Flags()(smallest, most defensive):if c.cliConfig.GRPC == nil { c.cliConfig.GRPC = &GRPCConfig{} }
- In
readConfigFile(treats the HCL path symmetrically with the TOML path):config := &Config{ TxPool: &TxPoolConfig{}, Cache: &CacheConfig{}, Sealer: &SealerConfig{}, GRPC: &GRPCConfig{}, }



Summary
In this PR we implement the server-side half of the heimdall-bor gRPC migration started by polyproto's
GetBlockInfoInBatch,GetTdByHash,GetTdByNumber,GetAuthormessages (see this PR). It provides:Results
GetBorChainBlockInfoInBatchwith 77% latency reduction — the call used in the consensus-criticalExtendVoteHandlerpath.Latency (ns per op)
Allocations (bytes and count per op)
Ambient system-level resource usage (per phase, 5 s interval across 1000 blocks)
CPU samples during bench window (pprof, 120 s each, aligned to bench run)
Heimdall CPU is ~17% higher on gRPC — consistent with the client doing proto decode per call instead of JSON-RPC decode.
Details
GetBlockInfoInBatch,GetTdByHash,GetTdByNumber,GetAuthor.Headerfidelity:headerToProtoBorHeadernow emits every 21 header fields.--grpc.tokenflag (orBOR_GRPC_TOKENenv var) + unary/stream interceptors. If the token is empty, auth is disabled. This matches the RFC 6750 §2.1 case-insensitive scheme.Bor.APIs()caches a single*APIinstance viasync.Once, so the gRPC backend shares the samerootHashCacheas the JSON-RPC registration path. Without this, I detected a 2.1×GetRootHashregression.GetBlockInfoInBatchsafety: size gate (≤ 256 blocks),end > MaxInt64guard, count-based loop (wrap-safe), per-iterationctx.Err()check.fetchBlockInforeturns(info, ok, err)so real failures propagate instead of silently truncating batches.InvalidArgumentfor malformed requests,NotFoundfor missing blocks/TD,OutOfRangefor numeric-overflow cases. Callers can branch on canonical codes instead of everything collapsing tocodes.Unknown.What this enables on the heimdall side
GetRootHash,GetVoteOnHash,HeaderByNumber,BlockByNumber,TransactionReceipt,BorBlockReceipt).x/milestone/abci) can issue a singleGetBlockInfoInBatchround-trip instead of 3×N JSON-RPC batch elements — up to 4.4× faster per call and 66% fewer bytes allocated per request on the kurtosis devnet benchmark.Info about the gRPC authentication
gRPC authentication via bearer token is not strictly needed when bor and heimdall run on the same host with bor bound to 127.0.0.1:3131 (the post-PR-#2078 default). This is because:
The token becomes meaningful in three cases:
Enabling it
gRPC is opt-in. Default config has bor listening on
127.0.0.1:3131and heimdall dialing over HTTP JSON-RPC — no behavior change for existing operators.Step 1: Bor side
[grpc]loopback default). Required — heimdall's clients assume both.config.tomlto opt into gRPC on the address heimdall can reach:ps aux/ shell history):bor serverreads it at startup; the flag stays as a fallback.Step 2: Heimdall side
Flip the config in
app.toml:Restart heimdall; the startup parity check runs one
HeaderByNumberover both transports and logs a Warning / Fatal if the hashes diverge (guard against a stale bor that doesn't populate the full proto header).Rollout recommendation
addr=127.0.0.1:3131, token empty or set viaBOR_GRPC_TOKEN. No TLS needed on loopback.addr=0.0.0.0:3131+https://URL + matching token. Heimdall refuses to send a token over plaintext to a non-loopback peer.Downgrade path
Set
bor_grpc_flag = "false"in heimdall and restart. Bor doesn't need to be downgraded — it keeps serving both transports regardless of the flag.Tests
GetBlockInfoInBatch4.4× faster on gRPC, 50–67% allocation reduction across all methods,GetRootHashregression closed via the cached API.Dependencies
GetBlockInfoInBatch/GetTdByHash/GetTdByNumber/GetAuthorRPCs and the expandedHeadermessage.[grpc]port, threadcl_bor_grpc_flag+el_grpc_urlthrough the CL launcher template) if you want to run the end-to-end bench on a devnet. The changes are hereOut of scope
BloomandLogsare intentionally not populated yet. No heimdall consumer reads them today; will be a small follow-up when needed.GetTdByNumbertag resolution (latest/pending) viaHeaderByNumber → GetTd(hash)is scoped to this gRPC handler; the same defect in upstreameth/api_backend.go:GetTdByNumberaffects the JSON-RPC path and is tracked separately.